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Abstract — We consider the problem of estimating the total 
probability of all symbols that appear with a given frequency 
in a string of i.i.d. random variables with unknown distribution. 
We focus on the regime in which the block length is large yet no 
symbol appears frequently in the string. This is accomplished by 
allowing the distribution to change with the block length. Under 
a natural convergence assumption on the sequence of underlying 
distributions, we show that the total probabilities converge to a 
deterministic limit, which we characterize. We then show that the 
Good-Turing total probability estimator is strongly consistent. 

I. Introduction 

The problem of estimating the underlying probability dis- 
tribution from an observed data sequence arises in a variety of 
fields such as compression, adaptive control, and linguistics. 
The most familiar technique is to use the empirical distribution 
of the data, also known as the type. This approach has 
a number of virtues. It is the maximum likelihood (ML) 
distribution, and if each symbol appears frequently in the 
string, then the law of large numbers guarantees that the 
estimate will be close to the true distribution. 

In some situations, however, not all symbols will appear 
frequently in the observed data. One example is a digital 
image with the pixels themselves, rather than bits, viewed as 
the symbols [1]. Here the size of the alphabet can meet or 
exceed the total number of observed symbols, i.e., the number 
of pixels in the image. Another example is English text. Even 
in large corpora, many words will appear once or twice or not 
at all [2]. This makes estimating the distribution of English 
words using the type ineffective. This problem is particularly 
pronounced when one attempts to estimate the distribution of 
bigrams, or pairs of words, since the number of bigrams is 
evidently the square of the number of words. 

To see that the empirical distribution is lacking as an 
estimator for the probabilities of uncommon symbols, con- 
sider the extreme situation in which the alphabet is infinite 
and we observe a length-n sequence containing n distinct 
symbols [3]. The ML estimator will assign probability 1/n 
to the n symbols that appear in the string and zero probability 
to the rest. But common sense suggests that the (n + l)st 
symbol in the sequence is very likely to be one that has not yet 
appeared. It seems that the ML estimator is overrating the data. 
Modifications to the ML estimator such as the Laplace "add 
one" and the Krichevsky-Trofimov "add half" [4] have been 
proposed as remedies, but these only alleviate the problem [3]. 



In collaboration with Turing, Good [5] proposed an esti- 
mator for the probabilities of rare symbols that differs con- 
siderably from the ML estimator. The Good-Turing estimator 
has been shown to work well in practice [6], and it is now 
used in several application areas [3]. Early theoretical work 
on the estimator focused on its bias [5], [7], [8]. Recent 
work has been directed toward developing confidence intervals 
for the estimates using central limit theorems [9], [10] or 
concentration inequalities [11], [12]. Orlitsky, Santhanam, and 
Zhang [3] showed that the estimator has a pattern redundancy 
that is small but not optimal. None of these works, however, 
have shown that the estimator is strongly consistent. 

We show that the Good-Turing estimator is strongly consis- 
tent under a natural formulation of the problem. We consider 
the problem of estimating the total probability of all symbols 
that appear k times in the observed string for each nonnegative 
integer k. For k = 0, this is the total probability of the unseen 
symbols, a quantity that has received particular attention [7], 
[13]. Estimating the total probability of all symbols with the 
same empirical frequency is a natural approach when the 
symbols appear infrequently so that there is insufficient data to 
accurately estimate the probabilities of the individual symbols. 
Although the total probabilities are themselves random, we 
show that under our model they converge to a deterministic 
limit, which we characterize. Note that if the alphabet is small 
and the block length is large, then the problem effectively 
reduces to the usual probability estimation problem since it is 
unlikely that multiple symbols will have the same empirical 
frequency. 

It is known that the Good-Turing estimator performs poorly 
for high-probability symbols [3], but this is not a problem since 
the ML estimator can be employed to estimate the probabilities 
of symbols that appear frequently in the observed string. We 
therefore focus on the situation in which the symbols are 
unlikely, meaning that they have probability 0(1/ n). We allow 
the underlying distributions to vary with the block length 
n in order to maintain this condition, and we assume that, 
properly scaled, these distributions converge. This model is 
discussed in detail in the next section, where we also describe 
the Good-Turing estimator. In Section [II]] we establish the 
convergence of the total probabilities. Section I1VI uses this 
convergence result to show strong consistency of the Good- 
Turing estimator. Some comments regarding how to estimate 
other quantities of interest are made in the final section. 



II. Preliminaries 

Let (fl n , Tn, P n ) be a sequence of probability spaces. We 
do not assume that 0„ is finite or even countable. Our 
observed string is a sequence of n symbols drawn i.i.d. from 
f2 n according to P n . Note that the alphabet and the underlying 
distribution are permitted to vary with n. This allows us to 
model the situation in which the block length is large while 
the number of occurrences of some symbols is small. 

A. Total Probabilities 

For each nonnegative integer k, let A? denote the set of 
symbols in f2„ that appear exactly k times in the string of 
length n. We call 

£fc : ~ -Pn (^fc) 

the total probability of symbols that appear k times. 

Of course, for k > 1, £]? is simply the sum of the 
probabilities of the symbols with frequency k. On the other 
hand, Aq will be uncountable if VL n is. 

We view as a random probability distribution on the 
nonnegative integers. Our goal is to estimate this distribution. 

B. The Good-Turing Estimator 

The Good-Turing estimator is normally viewed as an es- 
timator for the probabilities of the individual symbols. Let 
<Pk = I^I-I denote the number of symbols that appear exactly 
k times in the observed sequence. The basic Good-Turing 
estimator assigns probability 

(fc + iK+i 

to each symbol that appears k < n — 1 times [5]. The case 
k = n must be handled separately, but this case is unimportant 
to us since under our model it is unlikely that only one symbol 
will appear in the string. 

This formula can be naturally viewed as a total probability 
estimator since the ip k l in the denominator is merely dividing 
the total probability equally among the ip^ symbols that appear 
k times. Thus the Good-Turing total probability estimator 
assigns probability 

Cfc - n 

to the aggregate of symbols that have appeared k times for 
each k in {0, ...,»— 1}. As a convention, we shall always 
assign zero probability to the set of symbols that appear n 
times 

Q ■■= o. 

Like is a random probability distribution on the 

nonnegative integers. 

As a total probability estimator, is not ideal. For one 
thing, can be positive even when A? is empty, in which case 
is clearly zero. A similar problem arises when estimating 
the probabilities of individual symbols, and modifications to 
the basic Good-Turing estimator have been proposed to avoid 
it [5]. But we shall show that even the basic form of the Good- 
Turing estimator is strongly consistent for total probability 
estimation. 



C. Shadows 

The distributions of the total probability, and the Good- 
Turing estimator, are unaffected if one relabels the symbols 
in fl n . This fact makes it convenient in what follows to 
consider the probabilities assigned by P n without reference 
to the labeling of the symbols. 

Definition 1: Let X n be a random variable on O n with 
distribution P n . The shadow of P n is defined to be the 
distribution of the random variable P n ({X n }). 

As an example, if 0„ = {a, b, c} and 

Pn{{a})=Pn{{b}) = \Pn{{c}) = \, 

then the shadow of P n would be uniform over {1/4, 1/2}. If 
P n is itself uniform, then its shadow is deterministic. Note 
that the discrete entropy of a distribution only depends on the 
distribution through its shadow. We will write P n (X n ) as a 
shorthand for P n ({X n }) in what follows. 

For finite alphabets, specifying the shadow is equivalent 
to specifying the unordered components of P n , viewed as a 
probability vector. This is clearly seen in the above example, 
since the shadow is uniformly distributed over {1/4,1/2} if 
and only if the underlying distribution has two symbols with 
probability 1/4 and one with probability 1/2. 

If P n has a continuous component, then the shadow will 
have a point mass at zero equal to the probability of this com- 
ponent. The shadow reveals nothing more about the continuous 
component than its total probability, but we shall have no need 
for such information. Indeed, the distributions of both and 
Q depend on P n only through its shadow. 

D. Unlikely Symbols 

To prove strong consistency, we assume that the scaled 
profiles, n ■ P n (X n ), converge to a nonnegative random vari- 
able Y with distribution Q. This implies, in particular, that 
asymptotically almost every symbol has probability 0(l/n) 
and therefore appears 0(1) times in the sequence on average. 
As an example, if P n is a uniform distribution over an alphabet 
of size n, then the scaled shadow, n ■ P„(X n ), equals one 
a.s. for each n (and hence it converges in distribution). More 
complicated examples can be constructed by quantizing a fixed 
density more and more finely to generate the sequence of 
distributions. 

III. Total Probability Convergence 

Before considering the performance of the Good-Turing 
estimator, we study the asymptotics of the total probabilities 
themselves. Under our assumption that the scaled shadows 
converge, we show that the total probabilities converge almost 
surely to a deterministic Poisson mixture. 

Proposition 1: The random distribution £" converges to 

A,: [^^<IQUn I- 0.1.2.... 

in L 1 almost surely as n — ► oo. 

We prove this result by first showing that the mean of £™ 
converges to A and then proving concentration around the 



mean. To show convergence of the mean, it is convenient to 
make several definitions. Let 
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1 + — ) -> exp(y) if y n -> y, 
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it follows that for all sequences y n — > y, g^(y n ) ~* 9k{v)- 
Note also that g£(y) < 1 if < y < n by the binomial 
theorem. Let 

C n = {oj e 0„ : P„(w) > 0} 

and note that C" is countable for each n. 
Lemma 1: For all nonnegative integers fc, 

lim = A fe . 

n — >oo 

Proof: We shall show that 

Em = E[g n k {nP n {X n ))]. (1) 
First consider the case k > 1. Here 

= P n L4£ n C") 

= l(w6AJ)P B («) 

so by monotone convergence 

/( 
fc 



£ ff jf(nP n (w))P„H 



wee" 

= E[g%(nP n (X n ))l(X n e C*)] 
= P[ 5 £(nP n (X n ))]. 

Next consider the case k = 0. Here 

= P n (A%nC n ) + P n (A%-C n ) 

= l(wG^)P„(w) + P n (0„-C"). 

uEC" 

So again by monotone convergence, 

wee" 

= Y 9o(nP n (u))P n (u) + P„(Q„ - C n ) 

= E{g%(nP n (X n ))l(X n e C")] 

+ P[<#(nP„(X„))l(X„ g C")] 
= £[«#(nP n (X n ))]. 



This establishes ([O- Since nP ra (X ra ) converges in distribution 
to Y, we can create a sequence of random variables {l^l^i 
such that Y n has the same distribution as nP n (X n ) and Y n 
converges to Y almost surely [14, Theorem 4.30]. Then 

9k(Yn) - 9k(Y) a.s. 

Since g k l (Y n ) < 1 a.s., the bounded convergence theorem 
implies 

Jim E[gl(Y n ))=E[g k {Y)] 

gk{y) dQ{y) = A fc . 

□ 

Lemma 2: For all nonnegative integers fc, 
lim \%-E[%]\=0 a.s. 



Proof: Let 



B" = weSl„ : P n (io) > 



1 



,3/4 



and note that \B n \ < n 3 / 4 . Then let 

f£ = P n {A n k n B"), 

and note that 

14" - I < | (4 " 4") - E\% - | + # + ■ 

Now if we change one symbol in the underlying se- 
quence, then — |j? can change by at most 2/n 3 / 4 . By 
the Azuma-Hoeffding-Bennett concentration inequality [15, 
Corollary 2.4.14], it follows that for all e > 



Pr 



> e < 2 exp 



- #) - E\c k - a 1 ] 

Since the right-hand side is summable over n, this implies that 

a.s. 



Now 



(4 ~ 4) - E[% - 

& = E ^M 1 ^ e A k) 



so 



uEB" 



P[lfe]= E Pn^)(l)(Pn{0j)) k {l-Pn{0j)) 



\n — k 



< 



E 

uEB" 



(p„H) fe (i-p n M) 



i—k 



But 



(p„( w )) fe (i-p n Hy 



exp 



-n I B I - 



n 



Pn(w) 



where B(-) denotes the binary entropy function and -D(-||-) 
denotes binary Kullback-Leibler divergence, both with natural 



logarithms [16, Theorem 12.1.2]. For all sufficiently large n, 
k/n < 1/n 3 / 4 , which implies that for all lo € B n , 



n 



( \ 



,3/4 



This gives 



(P„(w)) fe (l-P n (w)) 



n — k 



< 



1 
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Since 



< 7i 3 / 4 
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n 3/4 
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this implies 



(fc+3)/4 



1 



i—k 



(2) 



k\ \ ti 3 / 4 
Now the right-hand side tends to zero as n — > oo, so 

limi3[&]=0. 

n— >0 

In fact, the right-hand side of (|2) is summable over n. By 
Markov's inequality, 

Pr (£» > e ) < SI, 
e 

this implies that — > a.s. The conclusion follows. □ 
Proof of Proposition [7J It follows from Lemmas ^ an d 121 
that for each k, 

lim = \ k a.s. 

n — >oo 

That is, £™ converges pointwise to A with probability one. 
The strengthening to L 1 convergence follows from Scheffe's 
theorem [17, Theorem 16.12], but we shall give a self- 
contained proof since it is brief. Observe that with probability 
one, 

oo 

o = 

oo oo 
k=0 k=0 

where [•]+ and [■]" represent the positive and negative parts, 
respectively. Thus 

oo oo 

]T|A fe -^|=2£[A fe -^] + a.s. 

fc=0 k=Q 

But [Afc — Cfe ] + converges pointwise to a.s. and is less than or 
equal to A^. The dominated convergence theorem then implies 
that 



lim ^[A fe -££] + = a .s. 



fc=0 



IV. Strong Consistency 

The key to showing strong consistency is to establish 
a convergence result for the Good-Turing estimator that is 
analogous to Proposition ^ for the total probabilities. 

Proposition 2: The random distribution ( n converges to A 
in L 1 almost surely as n — ► oo. 

The desired strong consistency follows from this result and 
Proposition ^ 

Theorem 1: The Good-Turing total probability estimator is 
strongly consistent, i.e., 



Iim£|ff-G?|=0 a.s. 

n— >oo * — * 

Proof: We have 

n oo oo 

Ei£-#i<Ei#-A*i+x>*-G 



fc=0 



fe=0 



fc=0 



We now let n — > oo and invoke Propositions ^ an d 12] □ 
The proof of Proposition[2]parallels that of Proposition 0in 
the previous section. In particular, we first show that the mean 
of £" converges to A and then establish concentration around 
the mean. 

Lemma 3: For all nonnegative integers k, 

lim E[Q] = A,. 

n— >oo 

Proof: We shall show that 

E[Q]=E[g^- 1 ((n-l)P n (X n ))). (3) 
First consider the case k > 1. Here 

So by monotone convergence, 



= E 

wee™ 



n V + 1 



71-1 



(p n M) fe (i-p„M)"- fc - 1 p n H 



= P[ g r 1 (( ? 7-l)P„(X„))l(X„eC")] 
= E[g n k -\{n-l)P n {X n ))}. 

Next consider the case k = 0. Here 

Co" = -Kl 

71 

= -|A?nc T N + -L4.?-c n | 

71 71 



□ 



Again invoking monotone convergence, 



+ P n (Q n -C n ) 
= E <- 1 ((n-l)P ra M)P„M 



+ P„(Q„-C") 

= ^[^-^(n - l)P n LY„))l(X„ G C™)] 

+ E[g^-\{n - l)P n (X n ))l(X n $ C n )\ 
= E\gl-\{n-\)P n {X n ))\. 

This establishes (|3}. Following the reasoning in the proof of 
Lemma ^ this implies 

lim E[Q] = E[g k (Y)} = X k 



for all k. 

Lemma 4: For all nonnegative integers k, 



□ 



lim \Q-E[Q]\ = Q a.s. 

n — >oc 

Proof: Observe that if we alter one symbol in the under- 
lying i.i.d. sequence, then ( k l will change by at most 2(k + 
l)/n. As in the proof of Lemma 13 the Azuma-Hoeffding- 
Bennett concentration inequality [15, Corollary 2.4.14] then 
implies that 



Pr(|C-^[C fc "]|>e)<2exp 



o 

en 



8(k + 1) 



Since the right-hand side is summable over n, the conclusion 
follows. □ 
Proof of Proposition |2} The result follows from Lemma 
Lemma and Scheffe's theorem [17, Theorem 16.12] as in 
the proof of Proposition [2 □ 

V. Shadow Estimation 

Proposition [2 shows that the total probabilities converge to 
a deterministic limit, which is a function of the limit of the 
scaled shadows, Q. In fact, the total probabilities converge to 
a Poisson mixture, with Q being the mixing distribution. The 
functional form of the Poisson distribution enables us to create 
a simple function of the observed string, the Good-Turing 
estimator, that has the same limit as the total probabilities. In 
particular, we can consistently estimate the total probabilities 



without having to explicitly estimate Q. 

In general, such a shortcut might not be available. It is of 
interest therefore to study how to estimate Q itself from the 
observed string. With an estimator for Q, one could create a 
"plug-in" estimator for other quantities of interest. 
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